Skip to content

feat(grafana): add gpu workload history deep-link endpoint (#824)#596

Draft
SekiXu wants to merge 2 commits into
developfrom
feat/824-gpu-workload-history-backend
Draft

feat(grafana): add gpu workload history deep-link endpoint (#824)#596
SekiXu wants to merge 2 commits into
developfrom
feat/824-gpu-workload-history-backend

Conversation

@SekiXu

@SekiXu SekiXu commented Jun 29, 2026

Copy link
Copy Markdown

What type of PR is this?

Feature — new backend API endpoint (Grafana deep-link for GPU workload history).

Which issue(s) this PR fixes?

Backend track of #824 and #825 (cross-track User Stories; this PR is the backend portion only — not closing the stories).

What this PR does?

Adds GET /api/v1/datacenters/{dataCenter}/grafana/gpuWorkloadHistory/{hostname}, returning the Device dashboard (UID i-device) deep-links for a physical node's GPU Utilization (panel 50) and VRAM (panel 51), so the Frontend can add a "View Workload History" button on the physical GPU table.

Response:

{
  "gpuUtilizationUrl": "https://<vip>/grafana/d/i-device/device?orgId=1&var-GPU_HOST=<host>&from=now-3h&to=now&viewPanel=50",
  "vramUrl":           "https://<vip>/grafana/d/i-device/device?orgId=1&var-GPU_HOST=<host>&from=now-3h&to=now&viewPanel=51",
  "enabled": true
}

Changes (additive, zero breaking change):

  • definition/v1/grafana: new GpuWorkloadHistory struct (two URLs + enabled).
  • handlers/grafana/links.go: genGpuUtilizationHistoryLink / genGpuVramHistoryLink (reuse existing link-gen style + base.DataCenterVip).
  • handlers/grafana/handlers.go: forwardGpuWorkloadHistoryLinks + route.

Design notes:

  • One endpoint returns both URLs (per the spike) → Frontend gets both in one call.
  • Filter variable is var-GPU_HOST (hidden $GPU_HOST), not var-HOSTvar-HOST resolves to ipmi_sensor.hostname → blank panels.
  • enabled = nodes.IsExist(hostname) (cluster-wide). GetNodeGpusMap is intentionally avoided: it runs hex_sdk locally and reports the wrong node for remote hostnames.
  • Auth-free (GET /grafana/* is in the API auth-free allowlist), consistent with the other Grafana link endpoints.
  • Contract (see #824/#825): panel ids 50=Util / 51=VRAM and var-GPU_HOST must stay stable.

Test results (optional)

1). make sure the api docs have been updated

✅ Added /grafana/gpuWorkloadHistory/{hostname} to the OpenAPI spec
(cube-cos-openapi#103), and bumped the submodule pointer in this PR.

2). make sure the api works properly

  • Built on the x86 build container (go 1.24.2); ELF x86-64.
  • Deployed to test node sky150 and smoke-verified:
    • valid node → gpuUtilizationUrl (viewPanel=50) + vramUrl (viewPanel=51) + enabled:true
    • non-existent hostname → enabled:false
    • hostname consistency confirmed: gpu.host.host = Node.Hostname = sky150

🤖 Generated with Claude Code

@SekiXu SekiXu force-pushed the feat/824-gpu-workload-history-backend branch from 5ca75cd to 8863be5 Compare June 29, 2026 09:56
@SekiXu SekiXu self-assigned this Jun 29, 2026
@SekiXu SekiXu marked this pull request as draft June 30, 2026 02:06
…824)

GPU host panels live in the Device dashboard (UID i-device), so expose
two device-scoped Grafana deep-link endpoints:
  GET /api/v1/datacenters/{dataCenter}/grafana/devices/{hostname}/gpuUtilization (panel 50)
  GET /api/v1/datacenters/{dataCenter}/grafana/devices/{hostname}/gpuVram        (panel 51)

Each returns the existing grafana.Dashboard{link, enabled}. Links filter
by var-GPU_HOST (not var-HOST). enabled = nodes.IsExist(hostname)
(cluster-wide); GetNodeGpusMap is avoided as it is local-only and would
report the wrong node for remote hostnames. Auth-free like the other
grafana link endpoints.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@SekiXu SekiXu force-pushed the feat/824-gpu-workload-history-backend branch from 40aa26e to af7272b Compare June 30, 2026 07:23
…ocs (#824)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@SekiXu SekiXu force-pushed the feat/824-gpu-workload-history-backend branch from af7272b to 877ebb6 Compare June 30, 2026 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant